Hello and welcome. My name is Tom Ritter and I work for ISEC Partners. If you don't know
who Zax is, you will by the end of the talk. So this talk is about an anonymity network
that was started in the fledgling days of the cypherpunk era, the early 1990s. This
book, what many of you will probably call the Bible, had not even come out yet. But the
first edition had. And while you could export the book itself, the U.S. government had determined
you could not export the floppy disk that the code had come on. In fact, the U.S. was
actively investigating Phil Zimmerman for violating the Arms Control Export Act for
making the first few versions of PGP available. Dan Bernstein and the toddler-aged EFF went
on the offensive, taking the U.S. government to court and suing over the export controls
on crypto. And another group of people ultimately printed out the source code for PGP, exported
the book to Europe.
They ended in an OCR in 97, releasing a version of PGP that bypassed the export controls.
Alt.anonymous.messages was forged in the heyday of the cypherpunks and really overall
has changed very little in the intervening decade since it was last shaped in any major
way. But in that decade, what we have seen is a monumental focus of the nation's spy
agencies on not what was thought to be the most critical piece of information to encrypt,
the content itself, but rather on metadata. The people who know won't talk and the people
who talk don't know, but leaked court orders require Verizon to turn over call records
local and abroad. Now, I'm talking here so I don't know anything and I'm just speculating,
but the most straightforward thing to do with this data is to build communication graphs,
analyze the metadata, looking for patterns, identifying people of interest and figuring
out who they talk to. And the metadata around those data is being used to build communication
for content into listened
environment and data. I do hope you have found that useful, and I really do appreciate
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are synced. But from an information theoretic perspective, an adversary can see that you're
sending packets and communicating. That seems obvious. Of course they know that you're communicating.
But it's important to bear in mind for the future. Ideally, the adversary wouldn't even
know that you are communicating. Secondly, SSL makes no attempt at hiding who you're
talking to. So the fact that you're on Facebook, straightforward. And similarly, the adversary
knows when you're on Facebook and when you are sending data and when you are receiving
data, and the resolution on this goes down to the microsecond. So they know exactly when,
but they also know exactly how much data you receive. SSL doesn't have any real padding,
and I don't know of any website that adds variable length padding to frustrate length
analysis. So how many of you stayed through Runa's talk? A few. Thank you. So let's
talk about Tor. Tor is an implementation of onion routing.
Where you pass messages along a chain, each node peeling off a layer of encryption until
an exit node talks to the intended destination. The destination responds and it's routed
back. Onion routing specifically aims to disguise who is talking. An adversary observing
you can't see that you're talking to a website or a service, and an adversary observing that
website or service can't see who is talking to it. But it doesn't stop an adversary from
knowing you're talking to someone, knowing when you're talking and how much you're saying
Tor doesn't really do padding. What little it does is not intended to be a security feature.
Tor explicitly leaves out length padding. And if you stayed through Runa's talk, you
know that Tor cannot protect you if an adversary can see the entire path of a circuit. Let's
say, hypothetically speaking, that New Zealand, Australia, the U.S., Canada, and the U.K.
were to say conspire on some sort of spy program.
Well, if your circuit went through these countries, Tor can't help you, at least not information
theoretically. The adversary can track your traffic and find out who you're talking to.
I'm not saying this is actively happening. I'm saying we've proved in papers that it's
possible and that it's explicitly outside of Tor's threat model. And a slightly more
difficult version of that attack is if the adversary can see you and then see the last
leg of your path later on, like, say, you're in China visiting.
Well, they can do a similar attack and track you down. It requires a little bit more math,
a little bit more correlation, but, again, we've proved that it's possible and it is,
again, outside of Tor's threat model. And this is particularly concerning seeing as
I, like probably most of you, happen to live in the U.S. And so much of what we do happens
to be hosted in Amazon EC2 in Virginia. So if either of those two cases apply, we're
basically back at SSL.
And at this point, I think it's worthwhile to show a couple of attacks on metadata. So
IOactive built a proof of concept traffic analysis tool that looks at your SSL session
with Google and figures out what part of Google Maps you're actually looking at, all based
off the sizes of the tiles that you're downloading over SSL. And it's worthwhile to note that
this is an attack on a client, on someone browsing Google Maps at that moment. Let me
show an alternate example.
You're sitting on Facebook, with Facebook chat enabled, all over SSL. Heck, all over
Tor. Well, Facebook chat turns you into a server. You're able to receive messages from
people and they will be pushed down to you. The attacker, not you, determines when you
will receive a message, and that's a pretty powerful capability, and it can lead to time-based
correlation attacks. An adversary sends you a message and looks at all the people connected
to Facebook or Tor and sees whether you are connected or not. They always follow you through.
whose receives a message right after that. And even easier, because Facebook chats tend
to be huge, it can lead to size-based correlation attacks. Not only do I send you a Facebook
chat, but I send you a huge Facebook chat. With only a couple of trials, you can be pretty
confident that the user whose Internet connection you're monitoring is the same anonymous Syrian
dissident that you're messaging on Facebook. And it's interesting to note that a very similar
attack was used to de-anonymize Jeremy Hammond, who is currently awaiting trial for allegedly
dumping Stratfor's mail spools. The police staked out his home, watched him enter, saw
some Tor traffic, and whoop, the username that they thought was him popped onto IRC.
Classic traffic confirmation attack. And I've gotten some comments that they also might
have cut his Internet connection and saw him drop off. I haven't been able to personally
confirm that in the police logs. I haven't had time. But if that's true, that's another
type of traffic confirmation attack.
Thank you.
That's on a low latency connection. Now, the good news is that even if the adversary
can see the start and end nodes or even the entire path, there is a way to disguise who
you're talking to. And that's mixed networks. Mixed networks introduce a delay while they
collect messages into a pool and then fire them all out. Collecting messages prevents
an adversary who's observing the mix from knowing what message went where. It introduces
uncertainty.
And I really like mixed networks and I want to encourage the research and adoption. So
I actually want to take a quick moment to demonstrate it to you live on stage. So right
now I'm going to be a Tor node or an onion routing node or a low-latency anonymity network
and I'm going to receive a packet and then send it right out. Now I'm going to play
a mixed node or a remailer node. And I'm going to collect a packet, stick it in my
bag, collect another packet, and then send it back out of the package. I want to get
it in my bag and collect another packet and stick it in my bag. I'm going to shuffle these
up. I'm going to peel out the outer layer of encryption and now I'm going to send them
out all at once. So you, the global passive adversary who can observe my computer and
see all the traffic I send and receive, you saw that I received three messages and you
saw that I sent out three messages, but you don't know which message went where. That's
the uncertainty. So mixed networks demonstrated we've gained
back a certain amount of protection against figuring out who was communicating with who.
Given enough time or low enough traffic volume, an adversary can perform the same types of
attacks I described against Tor, correlating messages, but it takes a lot more observation.
The easiest thing to learn that takes no time or analysis is the fact that I'm communicating.
We don't disguise the if. We also don't disguise the if. We don't disguise the if. We don't
disguise the when. And we also don't disguise how large it is. So enter shared mailboxes
and alt.anonymous.messages. That's a bit of a wordful. I'm going to abbreviate alt.anonymous.messages
to AAM. So a shared mailbox is what it sounds like. Imagine an e‑mail account where everyone
in the room has the user name and password, but it's read‑only access. You can't delete
messages. You can't send them. All the messages are encrypted, so what you do is you download
them all. You can't delete them. You can't send them. You can't send them. You can't
decrypt them all as one of the people with access to this inbox, and then you try and
decrypt each one of them with your private key. And the ones that you can decrypt are
to you, and the ones that you can't decrypt aren't. And you don't know who they're to.
Well, someone watching this encrypted connection, watching you accessing this mailbox and downloading
all the messages, they can see that you're accessing the mailbox. That's certain. And
they know that you downloaded all the messages. But they don't know if you were able to decrypt
any of them. And because of that, they don't know when you received a message, who it was
from, or how large it was. All they know is that you're checking the mailbox, not that
you're actually getting mail. At the cost of a lot of bandwidth, receiving messages
via a shared mailbox provides an awful lot of security comparatively.
Now, shared mailboxes are an awesome anonymity tool, but the difference between an awesome
anonymity tool and an anonymity tool that's actually used is the answer to the question,
can I interact with the rest of the world? Tor is wildly successful compared to any other
anonymity system because you can browse the actual Internet with it. It's not a closed
system where you only interact with hidden services. So for a shared mailbox to actually
be used, it needs to interact with normal e-mail. And that's where NIMSERVs come in.
The simplest NIMSERV and the newest and easiest to use receives a message at a domain name
and then just posts it immediately to alt.anonymous.messages. This is a NIMSERV written by Zacks and it's
on GitHub. The much more complicated type 1 or GHIO NIMSERVs can forward the mail to
another e-mail address or directly to alt.anonymous.messages, or they can even route it through a remote
e-mail or network to eventually wind up in one of those two places, and I'll talk more
about this NIMSERV later on. So if we added NIMSERVs to send mail, shared
mailboxes have awesome anonymity for the recipient, and when you send the message to a NIM using
‑‑ that uses a shared mailbox, you're ideally using an onion router or a mixed network,
although you don't have to, and thus you would have those security properties. An adversary
can see that you're sending, when you send it, and how large it is.
So now that I've walked through the security properties of the different types of anonymity
networks, let's actually dive into AAM. It should really have strong security. After
all, it's the most theoretically secure. But if you've never ‑‑ if you've never
looked at it before, this is what it looks like, at least in Google groups. It's Usenet.
How many people are old enough to have used Usenet? All right. Good, good. So there's
a whole bunch ‑‑ this is what it looks like today. A whole bunch of people are using
hexadecimal subjects, all posted by anonymous or nobody. And any individual message usually
looks like a PGP message that may or may not have a version string.
Today there are about 190 messages posted per day. But what's interesting is that while
the average has certainly decreased over the last decade, it's held somewhat steady in
the last five years. So the data set that I worked off of was about 1.1 million messages
from the last ten years.
Now, we can really see some shortcomings here already. Over half of the messages in my data
set go through two people. The network diversity is horrible. And if you stayed through it,
as you know, that's kind of important. If either one of these folks, Zacks or Dism,
got subpoenaed, shut down, or just retired, the whole network would be thrown into disarray.
And to the person who asked about directory authorities in Tor, Dism is one of the directory
authorities in Tor.
And Dism is not affiliated with the Tor project. He's just someone they trust.
Now, this looks pretty bad. It's way worse.
That 53.5% statistic was over the entire data set. Today Zacks and Dism make up virtually
all of the messages posted to AAM. I don't mean that they're sending them all. I mean
that they are the exit node for all the messages posted to AAM. And that dip, that weird dip,
That was 7,800 messages sent through Frel, which operates a remailer and a news gateway.
It had a unique subject.
It didn't have any unique headers.
I couldn't get a whole lot out of it aside from correlating those 7,800 messages uniquely.
So with network diversity pretty clearly abolished, let's take a look at the data and see what type of analysis we can actually do.
I don't think I can say anything ironic as this quote.
Keeping the cybertext around in public for a shorter time sounds like a good thing anyway.
And that's from 1994.
So here we are just shy of 20 years later.
And the first thing to do is break it up by PGP versus not PGP.
And you can see it's overwhelmingly PGP messages.
But what are the not PGP messages real quickly?
I was coming up to...
I was trying to come up with a nice way to say crackpots.
I'm not sure if I succeeded.
But there are several people who have and continue to post just random rants about...
I'm not even really sure.
Some of them are definitely the lizard people.
And there are actually frequently asked questions that have sprung up in response to these guys
because people were just getting flat out confused by them.
And besides those, there are some other non-PGP messages.
I think the most interesting is a set of about 10,000 messages with the subject,
Operation Satanic, or Satanic Operation.
What's interesting about these messages is that they're clearly ciphertext, but it's alphabetic.
If you look at a single message, you might think that it's like a Caesar cipher or a visionaire
or some sort of polyalphabetic thing.
But if you look at them in whole, you see that it's a perfectly even distribution over a 16-letter alphabet.
In other words, I think it's a substitution cipher into hexadecimal and that it's actually ciphertext.
There are other message clumps that are similar to this.
So if you're into this sort of...
you know, analysis, have at it.
And the next thing to look at is what percent of messages were delivered to AAM via a NIMSERV or via a remailer.
Now these numbers are going to be a little bit off since some of the PGP or re-mail messages are actually to NIMS
and some of the PGP messages may be through re-mailers I don't know about, but it's something.
And we can see that a large portion are messages to NIMS,
which will be important when I tell you about how many NIMSERVs are actually still running.
Okay, so those somewhat interesting statistics aside, let's start diving into all of those hundreds of thousands of encrypted messages.
So if you didn't know, OpenPGP consists of packets, and each packet type just does something slightly different.
There's a packet type for a message encrypted to a public key and a packet type encrypted to a password.
So what are these packet types?
Well, these graphs show the popularity of each packet type.
For example, packet type 1 followed by packet type 9.
And the top five, the ones on the bottom, are the ones that you'd expect to see.
Packet type 1 is messages encrypted to a public key.
Packet type 3 is messages encrypted to a passphrase.
The actual ciphertext of a message is 9 or 18 for old style or new style,
and I separated out the messages to a single public key versus messages to multiple public keys.
Now, there are two that are just kind of weird.
These are the packet types you'd expect to see after you decrypted a message.
These are plain text packets.
There are actually a small number of messages that look like OpenPGP data.
They've got the whole begin PGP message ticker and their base 64, but they're actually just plain text sitting in plain sight.
And if we look at packet type 8, this is what we get.
It really is just compressed plain text data.
Unfortunately, it's also nonsense.
I don't know if there's a code there or not.
I didn't spend a whole lot of time on it after I looked at it.
I ran Organizing Bizarre Sabbatical.
It probably came out of some Markov generator somewhere, so I kind of moved on.
And what I moved on to was messages that were sent to public keys.
Now, it's super obvious to do analysis based on the public key that's in the message.
I promise you it gets a little bit more complicated later, but let's look at the key IDs.
Honestly, they're a pretty powerful segmenting tool.
I wanted to illustrate a couple of examples where key IDs can tell us more.
There was one key ID, and I've anonymized most of the specific data in this because
de-anonymizing people kind of isn't cool.
So there was one key ID that messaged very reliably through a NIMS serve, except for
two messages sent through Easy News.
And if you track down that very unique Easy News gateway and the user agent, well, we
find out that person also sent messages to another key ID, and we can start making inferences
across multiple types of metadata.
Now, I mentioned that I separated out the messages that were sent to multiple public
keys versus the ones sent to a single one.
If a message was sent to a single key, we don't know too much about it, especially because
they usually throw the key ID, so it's just all zeros.
But if a message is sent to more than one key, then we can draw communication graphs.
Now, it's not a strict communication graph in the sense that a message was sent from
Alice to Bob.
Technically, it's that Alice and Bob received the same message.
But in most situations, people will encrypt a message to themselves so they can read their
own sent mail.
So this was about the same ‑‑ I started drawing these pictures about the same time
as the Prism scandal started breaking, so I was feeling really uncomfortable that this
is probably what the NSA is doing to me and my friends.
But nonetheless, quick reference, green means that I was able to get the public key off
of a key server.
A circle means that a key received messages to it individually as well as to, like, it
and multiple other people.
And then the size of the circle and the width of the line is how many messages they received.
So there's this very nice symmetrical five‑person graph, and we've got these much larger communication
networks here, and a real big one here, and we've got a couple of interesting graphs with
central communication points.
You can kind of infer from that what you want.
And then we've got a couple of more interesting networks, and I think these are interesting
because they imply that not everybody knows everybody else.
This graph and the next one may really be a model of actual Internet where people will
e‑mail people in a complex interconnected but not fully connected way.
This is a fairly low‑volume network, and this one has quite a few higher‑volume folks
participating.
And then there's, like, the rest, the simple two‑person communications going on.
So I was working on the ‑‑ but let's talk about brute‑forcing ciphertext.
Now if you'll recall this graph, you saw that packet type 9 was by far the most common packet
type found.
There's over 700,000 of them.
Now this packet type is really interesting, so let's dive in a little bit into the OpenPGP
spec.
This packet is the actual ciphertext of the message.
It is only the encrypted data.
It doesn't say what algorithm it is, and it doesn't explain how to get the key.
So where's the key?
The key is in another packet.
It's in packet type 1 for public keys or packet type 3 for pass phrases.
But if you'll recall from that graph, there aren't any packets that precede packet type
9.
We've got a disconnect from what the spec says and the data that we actually see.
Until we find this.
The idea algorithm is used with the session key calculated as the MD5 hash of the password.
Yeah, the MD5 of the password.
This is absolutely legacy and we've had better ways of doing this in OpenPGP since the late
90s.
So while in the very beginning of AAM this might have been excusable, the fact that my
data set was from 2003 onward makes this a pretty horrible situation.
And we know how to do MD5s right now.
Really, really fast.
But that's only half of it.
We also have to do an idea decryption.
And then we have to detect if what we decrypted was the actual plain text or just random.
And while you can run randomness tests, they're slow and we're brute forcing here.
So we want to go as fast as possible.
This is all my way of trying to justify that I spent a lot of time writing GPU powered
code and running it for months and killing my home desktop.
But I did get results out of all this GPU cracking.
And in fact.
The first few dozen of the messages that we got was this one, which did not make me
feel terribly good about myself.
But I kept going.
And I got some HTML pages.
I got some weird SMTP logs.
I got a lot of partial re mailer messages.
But overwhelmingly what I got.
After I decrypted the message was an encrypted message.
Recursively encrypted PGP messages.
And in fact, here's a breakdown of how many recursions I hit.
I got about 10,000 decryptions into a public key message.
And another 2200 that went into another password protected message.
So I went and cracked those.
And I got about 49 messages that were two layers deep.
And then I had cracked some more of those.
And I went four layers deep.
And then there was this one bloody message that was four layers deep that I still couldn't
crack.
So it's pretty damn recursive.
Now for the number of messages I was trying to brute force, something like 700 or 800,000,
the fact that I only got about 10,000 cracked is not really great.
You know, password crackers would consider that an abysmal failure.
I'm not the best cracker.
I'm sure people can do better.
But what I do want to defend myself against.
Myself with is I'm not trying to crack passwords, I'm trying to crack encryption pass phrases
of the most paranoid people on the Internet.
So I think I did decent.
Now I haven't explained why there are so many recursively encrypted messages.
Like what the hell?
And to explain that, I have to talk about remailers.
How many people have ever used a remailer?
All right.
So about like two dozen.
So the tools that you've probably used, Mixmaster and Mixminion, are different.
They're dubbed type 2 and type 3 remailers.
That means there must be a type 1 remailer somewhere, right?
Well, they're basically dead, but the protocol itself lives on in Mixmaster.
And boy, what a protocol.
This is a manual of how to use most, but not even all, of the options supported by type
1 remailers.
Now some of the directives are on the left.
Now what's the difference between remail to, remix to, anon to, and encrypt to?
I don't know.
I sure as heck don't remember it.
I studied this stuff for a while.
So to use type 1, you actually have to type all of these out yourself.
It's not like a GUI where you just click a check box.
Now I had talked in the beginning about type 1 NIMSERVs.
Well type 1 NIMSERVs are the main recipients of these directives.
You would string together a mixed network chain of directives encrypted to different
nodes.
You'd type that all out yourself, by the way.
And that would be your reply block.
And when someone e-mails your NIM.
The NIMSERV would basically execute your reply block, sending the message off through
each of the steps, ultimately coming out to your real e-mail address or to alt.anonymous.messages.
And we're still seeing these messages posted.
But there are only two type 1 NIMSERVs operating.
One is Zach's, of course.
The other is Paranoisi.
Paranoisi is run by a group of Italian hackers in Milan.
They run Osta Sisi and Inventati, which you can kind of think of an Italian version of
Rise Up.
If you've ever heard of Rise Up.
So in conclusion, what are those nested PGP messages?
They're type 1 NIMSERV messages, where the key ID is the ultimate NIM owner.
If I don't have a key ID, then there's another layer of symmetric encryption which I haven't
cracked yet.
And when you download type 1 NIMSERV messages, you know all of the passwords, you peel them
off one by one, and finally you use your private key.
And these are all the recipients with more than five messages.
It's pretty top heavy towards just a few NIMS.
So communication graphs and brute forcing is really just the first quarter, I would
say, of the analysis that I did on AAM.
A majority of my time was spent doing correlation.
So even if I don't know who a message is to or what it says, it's valuable to know that
it's to the same person as another message or that it was sent by the same sender.
And why is that valuable?
Well, let's go back to this slide.
You can't tell if someone has even received a message in a shared mailbox.
But if I can correlate one message with another, then I can start determining that some unknown
person has received a message.
And once I know that two messages are related well, then I can start paying attention to
their timestamp and to the length.
And this goes even further.
Because people tend to respond to messages that they receive.
And since I know if someone has sent a message.
It might just be that they are replying to a message that they just received.
So let's talk more about correlation and some more analysis of what's going on in AAM.
First off, it's obvious that you can correlate messages that use a single constant subject.
But there are a lot of messages like these.
Like nearly half of all the messages post to AAM have a constant, like, English subject.
They don't use that hexadecimal stuff.
They do tend to be the older messages.
And they've tapered off recently, which makes sense.
But you can look at kind of these numbers, 22,000 messages in a cluster, 18,000 messages
in a cluster.
But let's talk about those random hexadecimal subjects.
Now, there are two algorithms to generate these subjects.
They're called encrypted subjects, or E subs, and hash subjects, or H subs.
And the point of these is to quickly identify which messages are for you and which messages
you should ignore.
For the folks who used Usenet, this is like download it, you can just download the headers
and not the whole bodies.
Now, personally, I think we're at the point where we could probably cut this step out.
But nonetheless, it's still there.
So let's break it.
E subs have two secrets, a subject and a password.
H subs have a single secret, a password.
It's considerably more difficult to brute force the E subs, and I ran out of time,
so I just focused on the H subs.
H subs were created by Zacks.
And as his services are used more and more frequently, they're more likely to be used
more and more.
They make up an increasing percentage of the subjects.
Now H subs have a random piece in them that you can kind of think of as an initialization
vector, as a salt.
And while I could try and shoehorn these into the existing SHA-256 crackers, it'd
be painful, you'd have to truncate the output.
So I just wrote my own GPU cracker again.
And I cracked about 3,500 H subs.
Better than the percentage of messages I brute force, but again, not a great percentage.
But again, these are the passwords of the most paranoid people on the Internet.
And I cracked about 3,500 H subs.
And I found an interesting set of messages with the H sub DangerWillRobinson, which was
used by some, but not all of the messages that were sent to a couple of particular key
IDs.
I cracked all the H subs of another key ID with the passwords of testicular and panties.
And if you don't know what shmegma is, don't Urban Dictionary it.
So if H subs and E subs are used to let a NIMM owner identify their own messages, you
can do something similar.
Let's say we want to target the NIM Bob.
What we can do is send a particularly large message to Bob full of nonsense.
And then we wait for a large message to pop out in AAM.
Zax's NIMSERV is near instantaneous, so this size-based correlation is pretty easy.
Type 1 NIMSERVs are not necessarily instantaneous, a little bit more difficult, but not too difficult.
You can just do it a couple of times.
And this works.
And it works pretty easily and effectively.
What we get is a specific message that we know is to a particular NIM, and at that point
we can target that for H sub cracking.
So I'm not done.
But unlike everything I presented before, what I'm going to talk about now is probability-based
attacks.
That is, I come up with a hypothesis that I can correlate messages with a probability
better than random if I look at property X, whatever X is.
Well.
How many of you like the scientific method?
I don't really have controls.
So what I'm doing is I'm coming up with a hypothesis and running it across the data
set.
And then I'm looking at the clusters of messages that pop out, and then I'm going to see if
I can figure out something else that correlates them.
And if I can see something else that correlates them, I call it a success.
That's how I kind of simulate controls.
So let's say I think if a message has a header value of X, I think that's a unique sender.
One sender is sending that value of X.
So I run that analysis, and I get clusters of messages encrypted to a single public key.
Well, if there was no correlation at all, I would probably get a distribution that looks
more random.
It would be encrypted to random public keys.
But with such a nicely segmented public key, I kind of think that this worked.
That's how I kind of simulate controls and find clusters of data when there is no other
just ‑‑ if there's a ‑‑ I don't know.
Sorry.
And even if I could have found that cluster by just looking at the public keys, the data
implies that I could use that trick, that is, that hypothesis, to find a cluster of
data when there is no other distinguishing characteristic.
So that's how I try and preserve some semblance of the scientific method.
So my first example is message headers.
That's a pretty big one.
Let's look at these.
Now, there are a few headers that are in nearly every message, but a long tail of headers
that are in only a few.
But these mostly unique message headers are not necessarily the gold mine that you might
think they are, and that's because headers can be added at the client, at the exit remailer,
at the mail to news gateway, or by the Usenet peer.
So what we have to do is to really go after the distinguishing headers is subtract out
the headers that were added by all the other parts of the path, which we can do by just
clustering by the exit remailer and then seeing which headers are on all of those messages
and kind of subtract those out.
And here are some great examples of headers that were specified by the client.
So user agent, obviously, X post type ID, X no archive.
If you've used Usenet, you know that X no archive is a client preference.
Now these three particular strange headers all formed a distinct clump of messages with
the unique subject of weed will save the planet.
And that's an easy example of how the idea of unique message headers can kind of correlate.
Now, X no archive, this means don't save it in Usenet.
It's a client request that most Usenet servers will obey.
It's also not the word that I have on the screen.
This is a misspelling of the header.
And there is one person, or at least I'm claiming one person, who has messed this up
and completely distinguishes their messages from everyone else's.
All 17,300 of them.
So this is what you want, right?
No.
Capitalization matters.
And this is not the correct capitalization.
What's interesting about this one is that it shows up on several long‑running threads
on AAM, composing nearly 28,000 messages.
And initially, I thought each of these threads was relatively independent of each other,
but after finding this little bit of information, I'm starting to seriously doubt that.
This one isn't right either.
There's 1,500 messages posted with this header.
Including some test messages that were posted with someone's real name.
This is actually the correct version, and there's about 135,000 messages that have it,
or a little more than 10%, which makes it distinguishing in and of itself.
So just out of curiosity, another hand showing, has anyone ever used a type 1 NIMSERV?
I don't see any hands.
Okay.
So encrypt subject is a directive for type 1 remailers that should be processed by the
remailer.
It should never make its way into Usenet.
This is a bug.
This is a client, this is a user messing up.
But I can't really blame them because type 1 is so horribly difficult.
There are over 10,000 messages like this.
And when you reuse the subject like these, you make messages without the encrypt subject
stand out.
That's the one on the far right.
Or even worse.
Mess it up once.
And then figure out how to do it, but keep using that same subject and password.
So this let me identify 52 e sub messages that were otherwise secure, but they messed
up once and sent it through in plain text.
And then there's encrypt key, another header that should never make it into Usenet, but
does because type 1 remailers are so hard to use.
There are over 10,000 of these messages.
And let's look at another header.
Newsgroups.
Just like mailing lists, you can post a message to more than one newsgroup.
But if you do, you're wildly in the minority, and that segments you.
Like this newsgroup.
There are 34 messages posted with this newsgroup.
And thank you so much to Comcast for making your users extremely distinguishable.
And what about this value?
AAM with four commas at the end.
I thought this was a correlation.
But after tracking it down, it was actually a bug.
Call us back.
This one was sent by the remailer, remailer.org.uk for one week in January of 2006.
Just some random trivia I pulled out.
How about this one with duplicated newsgroups?
These were sent through a large variety of remailers and have no obvious correlation
besides this value and that they have English subjects.
So the English subjects was another example of the control that I used to confirm that
using a unique newsgroup is a bad idea.
So this is a good one.
And now humans are creatures of habit, and as flaky as remailers have been, a lot of
people find a configuration that works for them and then they stick with it.
Well, if I partition people by the remailer and the news gateway that they use, that's
what the colored squares are, what was previously an anonymous discussion thread suddenly makes
it very easy to pick out who is saying what and who is agreeing with themselves.
And it's even easier if I add in the header signature on the far right.
And then here's a really interesting pattern that I observed.
There are a host of messages who have subjects with a one or a two in them, like soggy, soggy
two.
Well, I looked at these and found they were being posted together really close together.
And then I realized one of the options in type one remailers is to duplicate a message
for redundancy.
Send the message down.
Send the message down two different remailer chains just in case one becomes unavailable.
And while that gains you some measure of availability and redundancy, it's also quite distinguishing.
You could target a NIM like I described earlier with huge messages, and if you see two huge
messages appear, well, you know that that NIM's reply block duplicates the messages.
Then look for all the possible duplicate messages and you've got a candidate list of messages
to that NIM, even if you're unsuccessful doing an H sub or an E sub attack.
And a similar pattern I saw was these.
Look at each pair of messages that are in the slightly different backgrounds.
The second message comes out of Dism about five to six hours later than the one that
comes out to Panta Ray.
Now I don't know what this means, but it did stand out as distinguishing.
The subject for all of these was, again, weed will save the planet.
Also some messages were from Frel were mixed in with no obvious correlation to other messages.
So here's an example.
So there were a number of hypotheses I tried that did not turn up interesting data.
But there were more queries that could be run across this data set.
But I need to start wrapping up.
It all comes down to metadata.
What we saw in AAM is the obvious mistakes we kind of expect.
It also suffers a bit because we haven't taken into account the lessons that we've learned
in the 10 to 15 years since it was developed.
That's a lifetime in anonymity technology.
But I do think there's some traffic analysis.
Lessons that we haven't codified as best practice that we should.
So what does the future hold for AAM?
Well the security of a well-posted message is good with a lot of caveats.
If you use uncrackable pass phrases, only use servers that output key stretch packets,
post through re-emailers with no distinguishing characteristics, and you're willing to be
in a very small anonymity set, go for it.
I don't know how many people are using AAM today, but I don't think it's a lot.
What that means is if the government asks for a list of everyone who uses it, they could
probably get a really short list of names to dig fairly deeply into each of their lives.
And AAM crucially relies on re-emailers and news gateways.
And these services are dying.
Remember that two people, Zax and Dism, post more than 98% of the traffic to AAM.
And it's also text-based.
Very limited bandwidth.
And the NIMS serves themselves.
They're pretty crappy, architecturally speaking.
We give single-hop proxies like VPNs and Ultrasurfs a lot of shit because their architecture is
not nearly as strong as Tor's.
But NIMS serves are in that same category of trust this guy not to roll over on you.
I feel compelled to mention that the alternative is to use Tor, which you do trust, to send
e-mail via throwaway accounts on a service you do not trust.
And while this is a practice that pretty much everyone in this room has probably used or
at least thought of, it's also a really shitty architecture.
Now, the good news is we have something better.
We have a very strongly architected NIMS serve.
Pynchon gate was developed by Len Sassman, Bram Coom and Nick Mathewson and uses private
information retrieval instead of a shared mailbox.
It exposes less metadata, resists flooding or size-based correlation attacks.
However, it's not built.
It's been started, but it's got a very long way to go.
And it also requires a remailer network to operate.
And we don't really have a remailer network.
What we've got is Mixmaster and Mixminion.
Now, Mixminion is a bit better than Mixmaster, which doesn't have any link encryption, has
known attacks, uses old crypto with no chance of upgrading.
But both of these services suffer from the fact that we don't have a good solution to
remailer spam or abuse.
We don't have good documentation about them, and they both have horrible network diversity.
Under 25 people running Mixmaster, under five, five people running Mixminion.
So if we like Pynchon gate, the path forward also involves fixing Mixminion, and Mixminion
needs love.
Mixminion is currently unmaintained, but we have a to-do list that includes the items
that I've got here.
Some of them are extremely complicated, like moving to a new packet format.
Others are relatively straightforward.
Like improving the system.
Others give you the opportunity to practice writing crypto, designing a distributed trust
directory system, or writing a complete standalone pinger in any language or style that you want.
So if you're interested, there are a lot of cool opportunities here.
But what I keep coming back to is the fact that we have no anonymity network that is
high bandwidth, high latency.
We have no anonymity network that would have let someone securely share the collateral
of a murder video without Wikileaks being their proxy.
You can't take a video of corruption or police brutality and post it anonymously.
Now I hear you arguing with me in your heads, use Tor and upload it to YouTube.
No.
YouTube will take it down.
Use Tor and upload it to Mega or some site that will fight fraudulent take down notices.
Okay, but now you're relying on the good graces of a third party, a third party that
is known to host the video and can't take it down.
And you can say hidden services, and I'll point to size-based traffic analysis and confirmation
attacks that come with a low latency network, never mind Ralph Whiteman's recent paper that
pretty much killed hidden services.
We can go on and on like this, but I hope you'll at least concede the point that what
you're coming up with are workarounds for a problem that we lack a good solution to.
So if I've been able to entertain you, I'm glad.
If I've been able to inspire you to work on anonymity systems, I'm overjoyed.
If you want a place to start, I will point you there.
Thank you.
